[client] Replace WG interface monitor polling with netlink subscription on Linux#5857
[client] Replace WG interface monitor polling with netlink subscription on Linux#5857alexsavio wants to merge 6 commits intonetbirdio:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughStart now delegates interface monitoring to platform-specific watchInterface(ctx, ifaceName, expectedIndex). The inline 2s polling loop and Changes
Sequence Diagram(s)sequenceDiagram
participant Monitor as WGIfaceMonitor.Start
participant Watch as watchInterface
participant Kernel as Kernel/Netlink
participant Context as ctx
Monitor->>Watch: start(ctx, ifaceName, expectedIndex)
alt linux
Watch->>Kernel: netlink.LinkSubscribe(RTNLGRP_LINK)
Kernel-->>Watch: events (RTM_NEWLINK / RTM_DELLINK)
Watch->>Watch: inspectLinkEvent(update, ifaceName, expectedIndex)
else non-linux
Watch->>Watch: poll getInterfaceIndex() every 2s
Watch-->>Watch: compare index vs expectedIndex
end
alt deletion or recreation detected
Watch-->>Monitor: return (restart=true, err?)
else ctx done
Context-->>Watch: done
Watch-->>Monitor: return (restart=false, ctx.Err())
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR reduces long-running memory/GC pressure on Linux by replacing the WireGuard interface monitor’s periodic net.InterfaceByName polling with an event-driven netlink subscription, while keeping the existing polling approach on non-Linux platforms.
Changes:
- Refactors
WGIfaceMonitor.Startto delegate interface monitoring to a platform-specificwatchInterface. - Adds a Linux implementation using
netlink.LinkSubscribeto react toRTM_NEWLINK/RTM_DELLINKevents. - Adds a non-Linux fallback implementation that retains the existing 2s polling behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
client/internal/wg_iface_monitor.go |
Removes inline polling loop and dispatches to watchInterface; updates docs to describe Linux vs non-Linux behavior. |
client/internal/wg_iface_monitor_linux.go |
New Linux netlink-based watcher to eliminate periodic link-table dumps and allocation churn. |
client/internal/wg_iface_monitor_other.go |
New non-Linux polling watcher preserving previous behavior on other platforms. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| case syscall.RTM_NEWLINK: | ||
| // Recreation: a new link with our name appears at a | ||
| // different index. Same name + same index is just a | ||
| // flag/state change on the existing interface — ignore. | ||
| if eventName == ifaceName && eventIndex != expectedIndex { | ||
| log.Infof("Interface monitor: %s recreated (index changed from %d to %d), restarting engine", | ||
| ifaceName, expectedIndex, eventIndex) | ||
| return true, nil | ||
| } |
There was a problem hiding this comment.
On Linux, interface renames are not detected. If the tracked interface is renamed away from ifaceName, the kernel typically emits an RTM_NEWLINK for the same ifindex with a different Attrs().Name; the previous polling implementation would treat this as deletion (InterfaceByName fails) and trigger a restart. Consider treating eventIndex == expectedIndex && eventName != ifaceName as a restart condition (rename/replacement), or otherwise re-checking by name when receiving RTM_NEWLINK for expectedIndex.
There was a problem hiding this comment.
Fixed in 764c7eb. RTM_NEWLINK with eventIndex == expectedIndex && eventName != ifaceName now triggers a restart, matching the behavior of the previous polling implementation.
|
|
||
| case update, ok := <-linkChan: | ||
| if !ok { | ||
| return false, fmt.Errorf("link subscription channel closed unexpectedly") |
There was a problem hiding this comment.
If the netlink subscription channel closes unexpectedly, this returns shouldRestart=false which means the engine will keep running but without any WG interface lifecycle monitoring (Engine.Start only logs a warning in that case). Consider either attempting to re-subscribe/fallback to polling, or returning shouldRestart=true so the engine can recover monitoring by restarting.
| return false, fmt.Errorf("link subscription channel closed unexpectedly") | |
| return true, fmt.Errorf("link subscription channel closed unexpectedly") |
There was a problem hiding this comment.
Fixed in 764c7eb. The closed-channel branch now returns shouldRestart=true and logs a warning so the engine re-establishes monitoring.
| func watchInterface(ctx context.Context, ifaceName string, expectedIndex int) (bool, error) { | ||
| done := make(chan struct{}) | ||
| defer close(done) | ||
|
|
||
| // Buffer the channel to absorb event bursts (e.g. when many veth | ||
| // pairs are created/destroyed at once by container runtimes). | ||
| linkChan := make(chan netlink.LinkUpdate, 32) | ||
| if err := netlink.LinkSubscribe(linkChan, done); err != nil { | ||
| return false, fmt.Errorf("subscribe to link updates: %w", err) | ||
| } | ||
|
|
||
| // Race window: the interface could have been deleted (or recreated) | ||
| // between the initial getInterfaceIndex() in Start and LinkSubscribe | ||
| // completing its handshake with the kernel. Re-check explicitly so we | ||
| // do not block forever waiting for an event that already fired. | ||
| if currentIndex, err := getInterfaceIndex(ifaceName); err != nil { | ||
| log.Infof("Interface monitor: %s deleted before subscription completed", ifaceName) | ||
| return true, fmt.Errorf("interface %s deleted: %w", ifaceName, err) | ||
| } else if currentIndex != expectedIndex { | ||
| log.Infof("Interface monitor: %s recreated (index changed from %d to %d) before subscription completed", | ||
| ifaceName, expectedIndex, currentIndex) | ||
| return true, nil | ||
| } |
There was a problem hiding this comment.
The new Linux event-driven watcher is complex enough (netlink subscription, race re-check, event filtering) that it would benefit from automated coverage. Consider adding unit tests by abstracting the netlink subscription behind an interface/function so tests can inject synthetic RTM_NEWLINK/RTM_DELLINK updates and assert the correct (shouldRestart, err) outcomes (including the ‘deleted before subscription completed’ and recreation cases).
There was a problem hiding this comment.
Acknowledged. Adding tests requires factoring the netlink subscription behind an injectable seam, which is a larger refactor than this PR is scoped for. Tracked as a follow-up I am happy to send separately.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/wg_iface_monitor_linux.go`:
- Around line 58-60: The handler that watches the netlink subscription currently
returns (false, error) when the "link subscription channel closed unexpectedly"
(the `if !ok { return false, fmt.Errorf(...) }` branch), which prevents the
monitor from being restarted; change this branch so the function signals the
supervisor to restart monitoring (i.e., return shouldRestart=true along with the
error) so the engine will re-establish the subscription and resume monitoring;
update the same function (the netlink/link-monitor routine where `ok` is
checked) to return the restart indication and optionally log the closure before
returning to aid debugging.
In `@client/internal/wg_iface_monitor_other.go`:
- Around line 16-18: The file-level comment in wg_iface_monitor_other.go
incorrectly lists android and ios as platforms where this cross-platform
fallback is used; update the docstring to remove or qualify android and ios (or
state “desktop OSes such as darwin, windows, freebsd”) so it doesn’t contradict
the early exit behavior in Start() (see Start() in wg_iface_monitor.go) which
prevents use on android/ios; ensure the comment clearly contrasts this fallback
with the Linux-specific RTNLGRP_LINK approach referenced in the same comment.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: edd8ab0c-8119-4864-a529-23cd4459f52f
📒 Files selected for processing (3)
client/internal/wg_iface_monitor.goclient/internal/wg_iface_monitor_linux.goclient/internal/wg_iface_monitor_other.go
There was a problem hiding this comment.
🧹 Nitpick comments (1)
client/internal/wg_iface_monitor_linux.go (1)
55-55: Use%winstead of%vto properly wrap context errors for idiomatic error handling.At line 55, using
%wfollows Go's error wrapping conventions and enables downstream code to check forcontext.Canceled/context.DeadlineExceededviaerrors.Is(). The same pattern should be applied towg_iface_monitor_other.goline 39.♻️ Suggested patches
- return false, fmt.Errorf("wg interface monitor stopped: %v", ctx.Err()) + return false, fmt.Errorf("wg interface monitor stopped: %w", ctx.Err())Apply the same fix to
client/internal/wg_iface_monitor_other.goline 39.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/wg_iface_monitor_linux.go` at line 55, Replace the fmt.Errorf call that currently formats the context error with "%v" so it wraps the error using "%w" (i.e. change the string "wg interface monitor stopped: %v" to use "%w") in client/internal/wg_iface_monitor_linux.go (the return statement that returns fmt.Errorf("wg interface monitor stopped: %v", ctx.Err())) and make the same change in client/internal/wg_iface_monitor_other.go where the analogous fmt.Errorf call exists (line 39) so downstream code can use errors.Is to detect context.Canceled / context.DeadlineExceeded.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@client/internal/wg_iface_monitor_linux.go`:
- Line 55: Replace the fmt.Errorf call that currently formats the context error
with "%v" so it wraps the error using "%w" (i.e. change the string "wg interface
monitor stopped: %v" to use "%w") in client/internal/wg_iface_monitor_linux.go
(the return statement that returns fmt.Errorf("wg interface monitor stopped:
%v", ctx.Err())) and make the same change in
client/internal/wg_iface_monitor_other.go where the analogous fmt.Errorf call
exists (line 39) so downstream code can use errors.Is to detect context.Canceled
/ context.DeadlineExceeded.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1c8aff35-265e-4313-a985-fbea57d220dc
📒 Files selected for processing (1)
client/internal/wg_iface_monitor_linux.go
|
Addressed the |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/wg_iface_monitor_linux.go`:
- Around line 34-35: When netlink.LinkSubscribe(linkChan, done) fails we should
not permanently disable monitoring: wrap the netlink.LinkSubscribe call in a
short retry loop with exponential backoff (e.g., 3–5 attempts with increasing
sleeps) and return only on success; if all retries fail, start the platform’s
polling-based fallback monitor instead of returning error (so the monitor
goroutine stays active). Locate the netlink.LinkSubscribe invocation (with
linkChan and done) and implement the retry/backoff there, and on terminal
failure invoke the existing non-Linux polling monitor routine (reuse the
project’s polling implementation) so interface deletion/recreation detection
continues.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 57d738b9-e851-4c1a-8122-8db6fbc19c97
📒 Files selected for processing (2)
client/internal/wg_iface_monitor_linux.goclient/internal/wg_iface_monitor_other.go
🚧 Files skipped from review as they are similar to previous changes (1)
- client/internal/wg_iface_monitor_other.go
…on on Linux The WireGuard interface monitor introduced in netbirdio#4370 spawns a 2 s ticker that calls net.InterfaceByName(ifaceName) on every tick. On Linux, that function issues syscall.NetlinkRIB(RTM_GETLINK, ...) and dumps the entire kernel link table on every call, then linear-scans it for the matching name. On hosts with many veth interfaces (Docker, containerd, k8s) the per-call cost is dozens of KB and the ticker generates roughly 1 GB/day of allocation churn from this single source. On long-running clients the GC pressure plus span fragmentation manifests as a slow, monotonic RSS climb — observed at ~920 MB/day on a 4 GB Raspberry Pi running v0.68.1, eventually starving every other service on the host. This is one of the persistent leak sources behind netbirdio#3678 (still open across 0.65–0.68.x). A pprof allocs profile from a freshly-restarted 0.68.1 daemon shows (*WGIfaceMonitor).Start → getInterfaceIndex → net.InterfaceByName → syscall.NetlinkRIB as the single largest allocator at 513 MB / 29% of all allocations within ~20 minutes of uptime. Fix: split the watcher by build tag. * `wg_iface_monitor_linux.go` subscribes to RTNLGRP_LINK via netlink.LinkSubscribe (already a transitive dependency via `vishvananda/netlink`) and reacts to RTM_DELLINK / RTM_NEWLINK events for the tracked interface index. Allocations between events drop to zero. The same pattern is already used by `client/internal/networkmonitor/check_change_linux.go` for route events, so the dependency, idiom, and review surface are familiar. * `wg_iface_monitor_other.go` keeps the original 2 s polling loop for darwin / windows / freebsd / android / ios. No behavior change on those platforms — they do not exhibit the NetlinkRIB cost. * `wg_iface_monitor.go` keeps the shared `WGIfaceMonitor` type, the early-return checks (mobile / netstack / empty name), and `getInterfaceIndex`, then dispatches to the platform-specific `watchInterface`. A small race window between the initial `getInterfaceIndex` call in `Start` and `LinkSubscribe` completing its handshake is closed by re-checking the index after subscribing. Windows is also reported as affected in netbirdio#3678. The same event-driven treatment can be applied there using `NotifyIpInterfaceChange` from `iphlpapi`; left as a follow-up so this PR stays focused on the worst offender (Linux) with the smallest possible diff. Refs: netbirdio#3678
- linux: handle interface rename (RTM_NEWLINK with same index but different name). The previous polling implementation caught this implicitly via net.InterfaceByName returning "not found"; the event-driven version has to test it explicitly. - linux: return shouldRestart=true when the netlink subscription channel closes unexpectedly so the engine re-establishes monitoring instead of silently leaving the host unmonitored. Also log the closure at warn level. - other: tighten docstring to clarify that android/ios are compiled but never reached at runtime, since Start() exits early on mobile.
The rename-detection branch added in the previous commit pushed watchInterface over SonarCloud's cognitive-complexity limit (21 vs. 20 allowed) and made the RTM_NEWLINK case clause 11 lines long (rule limit: 10). Pull the entire switch into a small inspectLinkEvent helper. The main loop is now a thin receive-and-dispatch and the helper has its own self-contained complexity. No behavior change. Refs: SonarCloud go:S3776, go:S1151
The RTM_NEWLINK case clause in inspectLinkEvent was still 11 lines after the previous refactor (SonarCloud rule limit: 10). Pull both event handlers into their own helpers (inspectDelLink, inspectNewLink) so the case clauses become single-line dispatches. No behavior change. Refs: SonarCloud go:S1151
Use %w instead of %v in fmt.Errorf so downstream code can inspect the wrapped error with errors.Is(err, context.Canceled) and errors.Is(err, context.DeadlineExceeded). Applied to both wg_iface_monitor_linux.go and wg_iface_monitor_other.go.
Return shouldRestart=true when netlink.LinkSubscribe fails so the engine triggers a client restart and re-establishes interface monitoring, instead of silently disabling it for the process lifetime.
b1b2349 to
7133da0
Compare
|



Describe your changes
The WireGuard interface monitor introduced in #4370 spawns a 2 s ticker that calls
net.InterfaceByName(ifaceName)on every tick. On Linux, that function issuessyscall.NetlinkRIB(RTM_GETLINK, ...)and dumps the entire kernel link table on every call, then linear-scans it for the matching name. On hosts with many veth interfaces (Docker, containerd, k8s) the per-call cost is dozens of KB and the ticker generates roughly 1 GB/day of allocation churn from this single source. On long-running clients the GC pressure plus span fragmentation manifests as a slow, monotonic RSS climb.This is one of the persistent leak sources behind #3678.
Smoking gun
Observed on a Raspberry Pi (4 GB, ARM64) running v0.68.1: netbird RSS climbed from 60 MB to 1.84 GB over ~2 days (~920 MB/day), eventually starving every other service on the host.
A pprof
allocs.proffrom a freshly-restarted v0.68.1 daemon (~20 minutes uptime) shows the WG interface monitor as the single largest allocator at 513 MB / 29% of all allocations:Math: 86,400 s/day ÷ 2 s = 43,200 calls/day × ~30 KB per call = ~1.3 GB/day allocated by this one ticker. That matches the observed leak rate.
Fix
Split the watcher by build tag.
wg_iface_monitor_linux.gosubscribes toRTNLGRP_LINKvianetlink.LinkSubscribe(already a transitive dependency throughvishvananda/netlink) and reacts toRTM_DELLINK/RTM_NEWLINKevents for the tracked interface index. Allocations between events drop to zero. The same pattern is already used byclient/internal/networkmonitor/check_change_linux.gofor route events, so the dependency, idiom, and review surface are familiar.wg_iface_monitor_other.gokeeps the original 2 s polling loop for darwin / windows / freebsd / android / ios. No behavior change on those platforms — they do not exhibit theNetlinkRIBcost (darwin usessysctl(NET_RT_IFLIST2), windows usesGetIfTable).wg_iface_monitor.gokeeps the sharedWGIfaceMonitortype, the early-return checks (mobile / netstack / empty name) andgetInterfaceIndex, then dispatches to the platform-specificwatchInterface.A small race window between the initial
getInterfaceIndexcall inStartandLinkSubscribecompleting its handshake is closed by re-checking the index after subscribing.Follow-up
Windows is also reported as affected in #3678 (multi-GB peaks on Server 2016/2019). The same event-driven treatment can be applied there using
NotifyIpInterfaceChangefromiphlpapi. Left as a follow-up so this PR stays focused on the worst offender (Linux) with the smallest possible diff.Issue ticket number and link
Refs: #3678
Original feature: #4370
Checklist
Verification
Cross-compiled clean for
linux/{amd64,arm64},darwin/{amd64,arm64}, andwindows/amd64againstmain. The patched binary has been running on the affected Pi for the last hour with no functional regressions; an extended-duration RSS comparison will be added once a meaningful sample is available.Summary by CodeRabbit